My Github repository for my assignments can be found at this URL: https://github.com/xinmiaotan/compscix-415-2-assignments
library(ggplot2)
data(mpg)
ggplot(data = mpg)
There is no plot shown, since we did not assign x and y variables.
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
There are 234 rows and 11 columns
?mpg
f = front-wheel drive, r = rear wheel drive, 4 = 4wd
ggplot(data=mpg, aes(x=hwy , y=cyl)) + geom_point()
ggplot(data=mpg, aes(x=class , y=cyl)) + geom_point()
class is categorical data, it has no meaning in the X
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
Color should not be mapped, should use the following code.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
?mpg
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
Categorical variables: manufacturer, model, trans, drv, fl, class. Continuous variables: displ, year, cyl, cty, hwy. This information helps to select what type of graph to use to visualize the data, histogram, bars, points or lines
#continuous
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = cyl))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = cyl))
#ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = cyl))
#categorical
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, size = class))
## Warning: Using size for a discrete variable is not advised.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
Color: Use the darkness of the color to identify the changes of continous variables. Use different colors to represent different category of the categorical variables. Size: Use the size change to identify the changes of continous variables. This is advised. Use the size change to represent different category of the categorical variable. This is not advised. Shape: A continuous variable can not be mapped to shape. Use different shape to represent different category of the categorical variables.
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
That variable has multiple aethetics appeared.
?geom_point
For shapes that have a border (like 21), you can colour the inside and outside separately. Use the stroke aesthetic to modify the width of the border.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(shape = 21, colour = "black", fill = "white", size = 2, stroke = 1)
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))
It will categorize that variable into following the logic or not follow the logic, such as displ<5 and displ>=5.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
Advantages: It shows the correlation between x and y by different classes, it is more intuitive. We can compare across panels. Disadvantage: It dose not show the correlation between x and y regardless the impact of the facet_wrap variable.
?facet_wrap
nrow, ncol address the Number of rows and columns for the panel.
scales: should Scales be fixed (“fixed”, the default), free (“free”), or free in one dimension (“free_x”, “free_y”). By default, the same scales are used for all panels. You can allow scales to vary across the panels with the scales argument. Free scales make it easier to see patterns within each panel, but harder to compare across panels. Example:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class, scales = "free")
shrink: if TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary.
labeller: A function that takes one data frame of labels and returns a list or data frame of character vectors. Example:
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(c("cyl", "drv"), labeller = "label_both")
To change the order in which the panels appear, change the levels of the underlying factor. Example:
mpg$class2 <- reorder(mpg$class, mpg$displ)
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
facet_wrap(~class2)
as.table: If TRUE, the default, the facets are laid out like a table with highest values at the bottom-right. If FALSE, the facets are laid out like a plot with the highest value at the top-right.
switch: If “x”, the top labels will be displayed to the bottom. If “y”, the right-hand side labels will be displayed to the left. Can also be set to “both”
drop:f TRUE, the default, all factor levels not used in the data will automatically be dropped. If FALSE, all factor levels will be shown, regardless of whether or not they appear in the data
dir:Direction: either “h” for horizontal, the default, or “v”, for vertical.
strip.position: By default, the labels are displayed on the top of the plot. Using strip.position it is possible to place the labels on either of the four sides by setting strip.position = c(“top”, “bottom”, “left”, “right”) Use strip.position to display the facet labels at the side of your choice. Setting it to bottom makes it act as a subtitle for the axis. This is typically used with free scales and a theme without boxes around strip labels.Example:
ggplot(economics_long, aes(date, value)) +
geom_line() +
facet_wrap(~variable, scales = "free_y", nrow = 2, strip.position = "bottom") +
theme(strip.background = element_blank(), strip.placement = "outside")
## Warning: Suppressing axis rendering when strip.position = 'bottom' and
## strip.placement == 'outside'
To repeat the same data in every panel, simply construct a data frame that does not contain the facetting variable.
Because it will make the variables value in the facet_grid to be the x and y of the panel. Example:
ggplot(mpg, aes(displ, hwy)) +
geom_point(data = transform(mpg, class = NULL), colour = "grey85") +
geom_point() +
facet_wrap(~class)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
# line
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_line()
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_smooth()
## `geom_smooth()` using method = 'loess'
# boxplot
ggplot(data = mpg, aes(x = class, y = hwy)) +
geom_boxplot()
# histogram
ggplot(data = mpg, aes(x = displ)) +
geom_histogram(binwidth = 0.2)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(show.legend = FALSE, se = FALSE)
## `geom_smooth()` using method = 'loess'
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
show.legend = FALSE did not show the legend. If remove show.legend = FALSE, lengend will be shown. In the earilier chapter, we want to show the lengend to indicate the label.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
se = FALSE exclude the display confidence interval around smooth
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess'
Same, they are mapping the same variables
#fig1
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'
#fig2
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(mapping = aes(group=drv), se=FALSE)
## `geom_smooth()` using method = 'loess'
#fig3
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color=drv)) +
geom_point() +
geom_smooth(mapping = aes(group=drv), se=FALSE)
## `geom_smooth()` using method = 'loess'
#fig4
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping=aes(color= drv)) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess'
#fig5
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, group = drv)) +
geom_point(mapping=aes(color = drv)) +
geom_smooth(mapping=aes(linetype = drv), se=FALSE)
## `geom_smooth()` using method = 'loess'
#fig6
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(fill = drv), color="white", size =2, stroke=2, shape = 21)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_col()
ggplot(data = mpg, mapping = aes(x = displ)) +
geom_bar()
geom_col: heights of the bars to represent values in the data, geom_bar: makes the height of the bar proportional to the number of cases in each group
Data scientist is someone who combine with new tools for analyzing the data. People who are capable of translating the trove of data created by mobile sensors, social media, surveillance, medical imaging, smart grids and the like - into predictive insights that lead to business value. Data scientists are comfortable operating with incomplete data. Data scientists are more likely to be involved across the data lifecycle: acquiring new data sets, parsing data sets, filtering and organizing data set , advanced algorithms to solve analytical problems, representing data visually, telling a story with data, interacting with data dynamically.